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1 .  SUMMARY 

In  this  Quarterly  Progress  Report,  we  present  our  work 
performed  during  the  period  December  8,  1979  to  March  7,  1980. 

1.1  Introduction 

The  work  in  the  last  quarter  was  in  the  areas  of  natural 
phonetic  synthesis,  phonetic  recognition,  and  multirate  speech 
compression.  The  recognition  and  synthesis  programs  will  operate 
together  as  a  very  low  rate  phonetic  vocoder.  Below  is  a  summary 
of  the  accomplishments  in  each  of  the  three  areas.  Details  are 
presented  in  Sections  2,  3  and  4. 

1.2  Phonetic  Synthesis 

The  exhaustive  testing  of  the  diphone  templates  has  been 
proceeding  according  to  schedule.  We  have  also  added  50  new 
diphones  to  allow  more  natural  synthesis  of  flapped  /t/.  Rules 
were  added  to  the  synthesis  program  that  allow  the  insertion  of 
glottal  stops  and  glottal  onsets . 

In  order  to  use  the  diphone  synthesis  program  as  the  final 
stage  of  the  MIT  text-to-speech  system,  an  interface  program  was 
written  to  convert  the  intermediate  phonetic  output  of  the  MIT 
text-to-phone  program  into  a  form  suitable  for  the  diphone 
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synthesis  program.  This  will  also  allow  the  comparison  of  the 
phonetic  synthesis-by-rule  program  with  the  diphone  synthesis 
program . 

1.3  Phonetic  Recognition 

During  this  quarter  improvements  were  made  to  the  phonetic 
recognition  programs  which  affect  its  performance,  efficiency, 
and  interactive  capabilities. 

We  added  statistical  programs  to  permit  the  continuing 
refinement  of  the  mean  and  standard  deviation  parameters  now 
stored  at  every  network  node.  We  added  training  code  which 
permitted  us  to  take  statistics  based  on  the  result  of  a  match  or 
to  add  new  paths  or  path  segments  to  the  network  from  the  current 
input  frames.  We  generalized  the  matching  algorithm  so  that 
theory  merging  could  be  done  at  every  network  node.  We  made  it 
easier  for  the  user  to  realize  the  alignment  desired  by  extending 
the  forced  alignment  to  include  time  constraints.  All  program 
control,  as  well  as  the  setting  of  many  control  parameters,  was 
made  directly  accessible  via  a  thoroughly  interactive  command 
loop. 


Our  future  work  is  discussed  as  it  appears  in  light  of  these 


recent  improvements. 
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2.  SYNTHESIS 


The  major  effort  this  quarter  in  synthesis  has  been  directed 
toward  exhaustive  testing  of  the  diphone  data  base.  We  have  also 
recorded  and  digitized  another  50  diphones  to  account  for  flapped 
/t/  [DX]  before  and  after  each  vowel. 

Two  substantial  programming  additions  were  made.  The  first 
allows  the  synthesis  of  glottal  stops  and  glottal  onsets.  The 
second  was  an  interface  between  the  MIT  text-to-speech  system  and 
the  BBN  diphone  synthesis  program. 

The  data  base  additions  are  described  in  Section  2.1, 
diphone  testing  is  discussed  in  Section  2.2,  and  programming 
changes  are  outlined  in  Section  2.3. 


2.1  Diphone  Data  Base. 


Now  that  the  initial  transcription  (labelling)  of  the 
diphone  data  base  is  complete  (see  QPR  No.  6),  we  have  turned  our 
attention  to  the  testing  and  "tuning"  of  the  data  base.  This 
testing  (described  below)  has  resulted  in  several  changes  to  the 
structure  of  the  data  base. 


We  have,  first  of  all,  determined  that  we  need  to  include 
the  phoneme  "flap"  (as  in  "butter" )  in  all  possible  vowel 
contexts.  We  are  currently  extending  the  data  base  to  include 


2.3.1  Glottal  Stops  by  Rule 


Based  on  acoustic  evidence  from  Sorensen  and  Cooper  [1] ,  as 
well  as  our  own  informal  observations,  we  have  implemented  a  rule 
for  synthesizing  glottal  stops.  A  glottal  stop  manifests  itself 
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3.  PHONETIC  RECOGNITION 


During  this  past  quarter  many  additions  and  improvements 
were  made  to  the  phonetic  recognition  programs.  These  are 
discussed  in  Section  3.1.  The  anticipated  direction  of  future 
work  is  discussed  in  Section  3.2. 


3.1  Phonetic  Recognition  Programs 


In  this  section  the  additions  and  improvements  to  the 
phonetic  recognition  programs  are  discussed.  There  are  two 
fundamentally  distinct  programs  which  we  will  talk  about  in  this 
section.  One  is  the  compiler,  which  takes  a  text  file  of  diphone 
descriptions  (in  essentially  the  same  format  as  the  input  used  by 
the  synthesis  compiler)  and  produces  a  network.  This  compiler 
has  the  capability  to  make  an  incremental  addition  to  an  already 
existent  compiled  network.  It  is  this  second  capability  which 
will  permit  us  to  train  the  network  and  later  add  entirely  new 
diphones  (if  necessary)  but  retain  the  training  on  the  original 
network.  The  second  program  is  the  matcher.  This  is  the  program 
that  actually  does  the  phonetic  recognition.  It  reads  in  a 
compiled  network  (trained  or  untrained)  and  uses  its  paths  when 
trying  to  determine  the  best  diphone  sequence.  The  matcher 
permits  the  user  to  specify  a  desired  result  (in  terms  of  a 
sequence  of  p4  'memes  and  optional  time  limits)  and  to  train  on  an 
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actual  alignment  (correspondence  between  network  paths  and  input 
frames)  once  the  match  has  been  completed.  This  training 
consists  of  collecting  statistics  (which  can  effect  both  the 
spectral  and  duration  scoring)  or  adding  new  paths  to  the  current 
network . 

Each  of  the  major  changes  which  have  been  made  are 
described.  The  motivation  for  the  change  is  presented  and  the 
necessary  programming  to  implement  the  change  is  mentioned. 

3.1.1  Statistical  Scoring  (Matcher  and  Compiler) 

Perhaps  the  most  important  change  was  to  generalize  the 
spectral  scoring  capability.  As  of  last  quarter  the  spectral 
score  used  was  the  weighted  Euclidean  distance  between  a  network 
node  and  its  corresponding  input  frame  whose  spectral 
"coordinates"  are  Log  Area  Ratios.  The  weighting  was  independent 
of  the  particular  network  node  however.  Therefore,  in  order  to 
make  this  weighting  dependent  on  the  particular  network  node 
being  scored,  weighting  information  was  added  to  each  network 
node,  in  the  form  of  standard  deviations.  This  change  required  a 
modification  to  both  the  compiler  and  the  matcher.  It  required  a 
modification  to  the  compiler  because  the  format  of  the  spectral 
information  stored  at  each  network  node  had  to  be  extended  to 
include  room  for  the  standard  deviation  of  each  parameter.  (This 
extended  spectral  information  is  equivalent  to  specifying  a 
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ar  more  partial  theories  arrive  at  the  same  network  node  at  the 
same  time,  i.e.,  each  theory  is  attempting  to  score  this  node 
with  the  same  input  frame, only  the  highest  scoring  theory  needs 
to  be  kept.  Originally  it  was  felt  that  merging  theories  at 
diphone  boundaries  would  be  satisfactory.  The  matcher  seemed  to 
function  well  and  did  not  generally  "discover"  better  scoring 
paths  if  the  theory  stack  size  were  doubled  or  tripled.  When  the 
time  constraints  were  included  as  part  of  the  forced  alignment 
code  we  noticed  that  frequently  the  entire  theory  queue  would  be 
depleted  and  the  matcher  would  fail  to  find  a  match  that  was 
consistent  with  the  requested  alignment  constraints.  The  history 
of  the  matching  process  was  carefully  examined  by  looking  at  the 
theories  on  the  stack  at  each  point  in  the  input.  Although 
theory  merging  was  being  done  properly  at  each  of  the  diphone 
boundaries,  the  number  of  theories  actually  on  the  theory  stack 
was  several  times  as  large  as  it  would  have  been  if  theory 
merging  had  been  done  at  every  single  node  in  the  network.  While 
simply  increasing  the  stack  size  would  have  given  us  the  result 
we  desired  (an  alignment  consistent  with  the  time  constraints)  we 
recognized  that  it  was  time  for  the  more  general  theory  merging 
capabilities  to  be  added. 

This  change  affected  both  the  compiler  and  the  matcher  since 
a  change  in  the  network  was  necessary.  Rather  than  describing 
the  structure  of  the  network  it  will  only  be  noted  here  that  this 
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change  requires  more  memory  per  network  node  than  was  previously 
required.  In  o^der  to  take  advantage  of  this  extended  network 
structure,  the  matcher  had  to  be  changed  to  detect  and  merge 
"comparable"  theories. 

This  change  was  a  very  significant  one  for  two 
reasons:  First,  the  new  strategy  for  theory  merging  permitted  a 
much  smaller  theory  stack  and  faster  operation  for  equivalent 
performance.  Second-,  the  additional  memory  requirements  made  it 
very  clear  that  we  were  about  to  run  out  of  memory  space. 

This  precipitated  yet  another  round  of  changes  in  order  to 
reduce  the  memory  space  required  by  the  compiled  network.  It 
appears  that  a  "reasonable"  amount  of  training  and  network 
modification  will  now  be  possible  before  we  run  out  of  memory 
space  again. 

3.1.4  Training 

Given  an  alignment,  that  is,  a  correspondence  between  input 
frames  and  network  nodes,  the  user  can  train  the  network  in  many 
ways : 

a)  Update  the  Statistics  at  each  network  node. 

The  user  is  permitted  to  collect  statistics  on  any  region  of 
an  alignment  that  results  from  a  match.  Once  a  match  is  made  and 
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the  resulting  alignment  printed,  the  correspondence  between 
frames  in  the  input  and  nodes  in  the  network  can  be  seen.  The 
user  specifies  that  statistics  are  to  be  collected  over  a  certain 
region  of  the  input.  The  correspondence  between  input  frames  and 
network  nodes  is  used  to  determine  which  network  nodes  are  to  be 
changed.  Specifically,  the  spectral  statistics  (means  and 
variances  of  each  parameter)  of  a  network  node  are  updated 
whenever  an  input  frame  is  specified  to  which  it  corresponds  in 
the  matched  alignment.  When  several  consecutive  input  frames 
have  all  been  aligned  with  the  same  network  node,  each  of  them  is 
used  as  a  single  sample  to  update  the  network  spectral 
statistics . 

The  duration  statistics  at  each  network  node  used  in  the 
match  alignment  is  updated  by  one  sample  (for  each  distinct  time 
it  is  used  in  the  statistics  command)  which  corresponds  to  the 
number  of  consecutive  input  frames  to  which  it  is  matched  in  this 
alignment . 

It  is  important  to  note  that  statistics  can  only  be  taken  if 
a  match  has  already  been  made  and  that  the  portion  of  alignment 
specified  in  the  statistics  command  is  sufficient  to  completely 
determine  what  network  nodes  are  to  be  updated. 
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for  each  node.  Thus,  whenever  the  network  diphone  contained  more 
nodes  than  there  were  frames  in  the  corresponding  region  of  the 
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added  *-o  the  network  in  this  way,  the  fact  that  the  paths  were 
taken  from  the  input  and  the  use  of  the  VFR  algorithm  frequently 
permitted  subsequent  matches  to  match  the  global  alignment  much 
more  accurately.  The  problem  with  this  procedure,  however,  was 
that  it  was  very  tedious,  since  several  matches  would  sometimes 
be  required  before  the  desired  alignment  could  be  realized. 

It  was  primarily  in  response  to  this  problem,  the 
tediousness  of  training,  that  the  capabilities  of  the  matcher 
were  changed  so  as  to  automatically  permit  a  single  input  frame 
to  simultaneously  match  two  consecutive  network  nodes.  Although 
this  addition  violates  the  spirit  of  the  scoring  philosophy  as 
presented  in  the  last  QPR,  it  was  added  with  the  understanding 
that  alignments  that  do  not  require  its  use  are  to  be  preferred 
to  those  that  do.  This  additional  matching  capability  was  added 
on  a  flag  and  is  going  to  be  used  primarily  to  enable  much 
quicker  attainment  of  the  desired  alignments.  Once  the  desired 
alignment  has  been  established,  the  training  capabilities  of  the 
matcher  can  alter  the  network  with  new  paths  that  do  not  require 
such  a  skipping  of  network  nodes.  Even  when  the  flag  is  on  and 
such  a  match  is  made,  network  frames  are  not  really  skipped  as 
such,  since  both  frames  are  scored  against  the  same  input  frame. 

It  is  possible  that  we  will  find  it  convenient  to  permit 
this  kind  of  matching  routinely,  since  the  current  indications 
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3.2.1  Ongoing  Debug  and  Program  Improvements 

First,  we  will  have  to  continue  the  process  of  tracking  down 
each  newly  discovered  bug  as  it  becomes  apparent  during  the 
ongoing  testing  and  training  of  the  recognition  programs.  At 
this  time  it  is  very  difficult  to  see  what  improvements  will  be 
necessary  since  practically  all  of  the  original  ideas  have  now 
been  implemented  and  the  training  is  just  beginning.  From  our 
past  experience  we  do  anticipate  that  some  changes  will  be 
necessary  although  we  hope  that  most  of  the  remaining  time  can  be 
spent  on  the  *rraining  of  the  network. 

3.2.2  Training 

Since  the  training  of  the  network  is  the  most  important 
remaining  part  of  the  work  to  be  done  (assuming  that  no  new  code 
needs  to  be  written)  we  expect  to  spend  a  great  deal  of  time 
during  the  next  quarter  running  the  system  on  sentences  on  which 
we  want  to  train. 

Currently  we  intend  to  train  some  subset  of  the  network 
first  (rather  than  to  try  to  get  samples  of  every  diphone)  to 
investigate  its  behavior  as  a  system.  By  first  training  on  those 
diphones  which  occur  most  frequently  in  natural  English  we  expect 
to  improve  the  overall  system  performance  rapidly.  Our 
subsequent  training  on  less  frequent  diphones  is  also  expected  to 
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bits  starting  at  the  end  of  the  block,  thereby  discarding  bits 
that  represent  DOT  components.  The  codes  representing  the  DCT 
components  are  arranged  in  a  certain  order  prior  to  transmission. 
This  ordering  determines  which  bits  get  discarded  first.  To 
study  the  tradeoff  between  the  number  of  transmitted  bits,  the 
quantization  accuracy,  and  the  number  of  received  frequency 
components,  we  investigated  three  ordering  techniques.  In  all 
three  techniques,  to  be  described  below,  we  assume  that  the 
receiver  decodes  the  system  parameters  and  performs  the  bit 
allocation  as  was  done  at  the  transmitter.  This  is  standard 
practice  in  adaptive  transform  coding.  In  addition,  we  require 
that  the  receiver  know  how  many  bits  are  received  each  frame  so 
that  it  knows  where  the  next  frame  begins.  This  last  piece  of 
information  is  passed  along  by  the  channel  itself. 

The  first  bit-ordering  technique  we  investigated  is  the 
simplest:  the  codes  are  arranged  by  order  of  increasing 
frequency.  When  the  channel  strips  off  bits  from  the  end  of  each 
block,  the  high-frequency  components  are  discarded  first.  The 
remaining  codes  represent  a  low  frequency  portion  of  the  total 
bandwidth  referred  to  as  a  baseband.  As  in  a  baseband  coder,  the 
receiver  regenerates  the  missing  high-frequency  components.  We 
use  the  method  of  high-frequency  regeneration  (HFR)  by  spectral 
duplication,  which  is  explained  in  Section  4.4. 
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present  a  frequency-domain  system  [3]  that  performs  HFR  by 
spectral  translation  of  the  baseband.  The  principle  of  the 
method  is  based  on  the  fact  that,  in  the  adaptive  transform 
baseband  coder,  the  baseband  DCT  components  can  be  easily 
duplicated  at  higher  frequencies  to  obtain  the  fullband 
excitation  signal.  Our  present  HFR  method  aims  at  duplicating, 
as  closely  as  possible,  the  original  fullband  DCT  of  the 
residual,  while  accommodating  the  variable  baseband  width  aspect 
of  the  present  multirate  system.  We  now  explain  the  method  with 
the  help  of  Fig.  4.2.  The  transmitter  assumes  a  (fixed)  nominal 
baseband  width  of  1000  Hz.  Thus,  the  simplest  spectral 
translation  method  would  be  to  duplicate  the  region  from  0  to 
1000  Hz  onto  the  regions  from  1000  to  2000  Hz  and  from  2000  to 
3000  Hz.  In  addition,  to  lock  the  high-frequency  interval  into 
place,  by  exploiting  the  quasi-harmonic  structure  of  the  speech 
spectrum,  we  shift  the  baseband  around  its  nominal  position  and 
correlate  it  with  the  corresponding  original  DCT  components  that 
are  in  the  region  1000  to  2000  Hz.  The  cross-correlation  is  done 
at  the  transmitter  where  the  original  DCT  of  the  fullband 
residual  is  available.  The  same  process  is  repeated  for  the  next 
frequency  band.  Short  lags  from  -3  to  +4  spectral  points  are 
considered.  (The  total  bandwidth  is  128  points.)  The  optimal 
location  is  then  chosen  to  be  at  the  positive  maximum  value  of 
the  cross-correlation.  Thus,  we  require  an  additional  3  bits  of 
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A.  At  the  transmitter 

1.  Choose  point  of  maximum  correlation. 

2.  Transmit  3-bit  HFR  codes 


■  B.  At  the  receiver 

1.  Translate  rec  ived  baseband 

2.  Fill  gaps 
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Fig.  4.2  High  frequency  regeneration  by  spectral  duplication 
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side  information  for  each  of  the  two  high-frequency  bands.  The 
additional  6  bits  are  transmitted  along  with  the  system 
parameters  (HFR  codes)  . 

At  the  receiver,  the  decoded  baseband  is  translated  up, 
starting  at  1000  Hz  (and  3000  Hz  for  the  next  band)  and  is  moved 
further  by  a  small  amount  as  indicated  by  the  3-bit  HFR  code. 

In  practice,  there  are  three  deviations  from  the  simple 
algorithm  described  above.  The  first  is  that  the  first  few  DCT 
compo.  jnts ,  starting  at  d.c.  and  up  to  half  the  pitch  frequency, 
are  not  duplicated  onto  the  high-frequency  bands,  nor  are  they 
con  'dered  in  the  correlation  method  described  above.  Second,  we 
found  that  spectral  flattening  of  the  baseband  at  the  receiver 
prior  to  HFR  improves  the  speech  quality.  The  third  deviation  of 
the  algorithm  is  due  to  the  fact  that  the  received  baseband  width 
is  seldom  equal  to  1000  Hz.  In  fact,  it  varies  from  frame  to 
frame,  because  the  received  number  of  bits  (and  therefore  number 
of  DCT  components)  is  set  b^  the  channel.  We  have  devised 
certain  modifications  to  the  method  to  deal  with  that  problem 
appropriately.  For  example,  following  the  HFR  process  described 
above,  there  can  be  gaps  in  the  regenerated  fullband  DCT,  as 
illustrated  in  Fig.  4.2.  To  fill  such  gaps,  we  translate  the 
received  baseband  in  such  a  manner  that  its  center  coincides  with 
the  middle  of  the  gap.  We  then  shift  the  baseband  and  correlate 
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we  coded  the  fullband  DCT  at  an  average  of  2,  2.25, 
2.5,  and  3  bits  per  sample.  In  our  experiments,  there 
was  no  maximum  limit  on  the  bit  rate  for  the  fullband 
case,  although,  in  practice,  the  system  will  be 
operated  at  9.6  kb/s  or  below. 

b.  Noise  Shaping.  We  used  various  values  of  the  noise 
shaping  parameter  y  [see  previous  QPR]  ranging  between 
0  and  1.  For  y  closer  to  1,  the  available  bits  are 
spread  more  evenly  in  the  frequency  range,  resulting  in 
a  larger  number  of  received  DCT  components  at  a  given 
bit  rate,  at  the  expense  of  coarser  quantization  in  the 
low-frequency  region  for  voiced  sounds. 

c.  Embedded  Coding.  We  simulated  the  three  bit  ordering 
techniques  described  in  Section  4.3  and  evaluated 
informally  the  speech  quality  obtained  with  each.  For 
each  technique,  we  optimized  the  system  in  terms  of  the 
total  fullband  rate  and  the  value  of  y  as  in  (a)  and 
(b)  above. 

d.  High-Frequency  Regeneration.  As  described  in  Section 
4.4,  the  transmitter  assumes  a  nominal  baseband  width 
for  which  it  computes  the  HFR  codes.  We  investigated 
the  use  of  3-bit  and  2-bit  HFR  codes,  and  the  use  of  a 
nominal  baseband  width  of  800,  900  and  1000  Hz. 
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For  each  choice  of  the  above  described  parameter  settings  we 
tested  the  system  at  9.6  kb/s,  7.2  kb/s  and  6.4  kb/s,  although, 
in  principle,  the  data  rate  can  be  set  by  the  channel  to  an 
arbitrary  value.  All  tests  were  done  with  5  male  and  5  female 
sentences.  Our  present  choice  of  a  good  compromise  system 
operating  uniformly  well  over  the  desired  data  rates  is  one  where 
we  code  the  fullband  DCT  of  the  residual  at  an  average  of  1.95 
bits  per  sample,  i.e.,  where  the  maximum  bit  rate  is  16  kb/s. 
The  value  of  Y  is  0.9,  and  the  embedded  coding  technique  is  the 
first  (baseband  coding)  ,  with  a  nominal  baseband  width  of  1000  Hz 
and  3-bit  HFR  codes.  The  data  rates  of  9.6,  7.2  and  6.4  kb/s  are 
achieved  by  keeping  at  each  frame  124,  80,  and  65  bits, 
respectively,  out  of  the  maximum  total  of  250  bits.  The  average 
width  of  the  received  baseband  for  the  three  cases  is  1400,  870, 
and  670  Hz,  respectively. 


At  present,  we  feel  that  the  above  described  system  is 
providing  us  with  very  good  speech  duality  at  9.6  kb/s,  good 
speech  quality  at  7.2  kb/s,  and  reasonable  quality  at  6.4  kb/s. 
The  problem  at  bit  rates  of  6.4  kb/s  or  below  is  that  the 
received  baseband  becomes  too  narrow,  which  results  in 
appreciable  roughness  in  the  coded  speech  and  some  "thuds.”  Also 
noticeable  at  6.4  kb/s,  especially  for  female  voices,  is  the 
reverberant  quality  of  the  reconstructed  speech. 
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